# Install and load packages 
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
devtools::install_github("ebenmichael/augsynth")
## Using GitHub PAT from the git credential store.
## Skipping install of 'augsynth' from a github remote, the SHA1 (982f650b) has not changed since last install.
##   Use `force = TRUE` to force installation
pacman::p_load(# Tidyverse packages including dplyr and ggplot2 
               tidyverse,
               ggthemes,
               augsynth,
               gsynth)

# set seed
set.seed(44)

# load data
medicaid_expansion <- read_csv('./data/medicaid_expansion.csv')
## Rows: 663 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): State
## dbl  (3): year, uninsured_rate, population
## date (1): Date_Adopted
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

For this project, you will explore the question of whether the Affordable Care Act increased health insurance coverage (or conversely, decreased the number of people who are uninsured). The ACA was passed in March 2010, but several of its provisions were phased in over a few years. The ACA instituted the “individual mandate” which required that all Americans must carry health insurance, or else suffer a tax penalty. There are four mechanisms for how the ACA aims to reduce the uninsured population:

In 2012, the Supreme Court heard the landmark case NFIB v. Sebelius, which principally challenged the constitutionality of the law under the theory that Congress could not institute an individual mandate. The Supreme Court ultimately upheld the individual mandate under Congress’s taxation power, but struck down the requirement that states must expand Medicaid as impermissible subordination of the states to the federal government. Subsequently, several states refused to expand Medicaid when the program began on January 1, 2014. This refusal created the “Medicaid coverage gap” where there are indivudals who earn too much to qualify for Medicaid under the old standards, but too little to qualify for the ACA subsidies targeted at middle-income individuals.

States that refused to expand Medicaid principally cited the cost as the primary factor. Critics pointed out however, that the decision not to expand primarily broke down along partisan lines. In the years since the initial expansion, several states have opted into the program, either because of a change in the governing party, or because voters directly approved expansion via a ballot initiative.

You will explore the question of whether Medicaid expansion reduced the uninsured population in the U.S. in the 7 years since it went into effect. To address this question, you will use difference-in-differences estimation, and synthetic control.

Data

The dataset you will work with has been assembled from a few different sources about Medicaid. The key variables are:

Exploratory Data Analysis

Create plots and provide 1-2 sentence analyses to answer the following questions:

# highest and lowest uninsured rates
pre_2014 <- medicaid_expansion %>% filter(year < 2014)

state_avg <- pre_2014 %>%
  group_by(State) %>%
  summarize(avg_uninsured = mean(uninsured_rate, na.rm = TRUE)) %>%
  arrange(avg_uninsured)


lowest_state <- state_avg$State[1]
highest_state <- state_avg$State[nrow(state_avg)-1]


pre_2014 <- pre_2014 %>%
  mutate(highlight = case_when(
    State %in% lowest_state ~ "Lowest(avg)",
    State %in% highest_state ~ "Highest(avg)",
    TRUE ~ "Other"
  ))


ggplot(pre_2014, aes(x = year, y = uninsured_rate, group = State, 
                     color = highlight, alpha = highlight, size = highlight)) +
  geom_line() +
  geom_point(data = pre_2014 %>% filter(highlight != "Other"), 
             aes(shape = State),
             color = "black") +
  scale_color_manual(values = c("Highest(avg)" = "red", 
                                "Lowest(avg)" = "steelblue", 
                                "Other" = "gray30")) +
  scale_alpha_manual(values = c("Highest(avg)" = 1, "Lowest(avg)" = 1, "Other" = .2)) +
  scale_size_manual(values = c("Highest(avg)" = 1.2, "Lowest(avg)" = 1.2, "Other" = .5)) +
  scale_shape_manual(values = c(16, 17, 18, 19)) + 
  scale_x_continuous(breaks = unique(pre_2014$year)) +
  labs(
    title = "Uninsured Rate Trends by State (2008-2013)",
    subtitle = "Highlighting states with highest and lowest average uninsured rates",
    x = "Year",
    y = "Uninsured Rate",
    color = "State Group",
    shape = "State"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90"),
    legend.box = "vertical"
  ) +
  guides(
    alpha = "none",
    size = "none"
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(pre_2014, 
       aes(x = year, y = uninsured_rate)) +
  geom_point() +
  geom_line() +
  facet_wrap(~State) +   
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90"),
    legend.box = "vertical",
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 5)
  ) +
  ggtitle('Uninsured Rate (2008-2013)') +
  xlab('Time') +
  ylab('Uninsured_rate')

medicaid_expansion %>% filter(year == 2008) %>% 
  filter(uninsured_rate == max(uninsured_rate))
## # A tibble: 1 × 5
##   State Date_Adopted  year uninsured_rate population
##   <chr> <date>       <dbl>          <dbl>      <dbl>
## 1 Utah  2020-01-01    2008          0.241    2942902
medicaid_expansion %>% filter(year == 2013) %>% 
  filter(uninsured_rate == max(uninsured_rate))
## # A tibble: 1 × 5
##   State Date_Adopted  year uninsured_rate population
##   <chr> <date>       <dbl>          <dbl>      <dbl>
## 1 Texas NA            2013          0.220   26956958

Prior to 2014, Florida had the highest average uninsured rate between 2008 and 2013, while Massachusetts had the lowest—likely due to its implementation of universal healthcare in 2006. During this period, the uninsured rates in these two states remained relatively stable, in contrast to significant fluctuations observed in other states. For example, Utah had the highest insured rate in 2008, but this declined by 2010. Meanwhile, Texas experienced a sharp increase in its uninsured rate in 2010, becoming the state with the highest rate thereafter.

# most uninsured Americans
uninsured_population <- medicaid_expansion %>%
  mutate(uninsured_pop = uninsured_rate * population) 

uninsured_population_pre2014 <- uninsured_population %>% 
  filter(year < 2014) %>%
  group_by(State) %>%
  summarize(avg_uninsured_pop = mean(uninsured_pop, na.rm = TRUE)) %>%
  arrange(avg_uninsured_pop)


lowest_state_pre2014 <- uninsured_population_pre2014$State[1]
highest_state_pre2014 <- uninsured_population_pre2014$State[nrow(state_avg)-1]

uninsured_population_2020 <- uninsured_population %>% 
  filter(year == 2020) %>%
  group_by(State) %>%
  arrange(uninsured_pop)

lowest_state_2020 <- uninsured_population_2020$State[1]
highest_state_2020 <- uninsured_population_2020$State[nrow(state_avg)-1]

uninsured_population <- uninsured_population %>%
  mutate(highlight = case_when(
    State %in% lowest_state_pre2014 ~ "Lowest pre 2014 (avg)",
    State %in% highest_state_pre2014 ~ "Highest pre 2014 (avg)",
    State %in% lowest_state_2020 ~ "Lowest 2020",
    State %in% highest_state_2020 ~ "Highest 2020",
    TRUE ~ "Other"
  ))


ggplot(uninsured_population, aes(x = year, y = uninsured_pop, group = State, 
                     color = highlight, alpha = highlight, size = highlight)) +
  geom_line() +
  geom_point(data = uninsured_population %>% filter(highlight != "Other"), 
             aes(shape = State),
             color = "grey30") +
  scale_color_manual(values = c("Highest pre 2014 (avg)" = "red",
                                "Lowest pre 2014 (avg)" = "steelblue",
                                "Highest 2020" = "brown4",
                                "Lowest 2020" = "blue4",
                                "Other" = "gray60")) +
  scale_alpha_manual(values = c("Highest pre 2014 (avg)" = 1, 
                                "Lowest pre 2014 (avg)" = 1, 
                                "Highest 2020" = 1,
                                "Lowest 2020" = 1,
                                "Other" = 0.2)) +
  scale_size_manual(values = c("Highest pre 2014 (avg)" = 1.2,
                               "Lowest pre 2014 (avg)" = 1.2, 
                               "Highest 2020" = 1.2,
                                "Lowest 2020" = 1.2,
                               "Other" = 0.5)) +
  scale_shape_manual(values = c(16, 17, 18, 19)) + 
  scale_x_continuous(breaks = unique(uninsured_population$year)) +
  labs(
    title = "Uninsured Population Trends by State (2008-2020)",
    x = "Year",
    y = "Uninsured Population",
    color = "State Group",
    shape = "State"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90"),
    legend.box = "vertical"
  ) +
  guides(
    alpha = "none",
    size = "none"
  )
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).

ggplot(uninsured_population, 
       aes(x = year, y = uninsured_pop)) +
  geom_point(size = .5) +
  geom_line() +
  facet_wrap(~State, scales = "free_y") +   
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90"),
    legend.box = "vertical",
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 5)
  ) +
  ggtitle('Uninsured Population (2008-2020)') +
  xlab('Time') +
  ylab('Uninsured Population')
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

Before 2014, California had the largest uninsured population, but after 2014, Texas took the lead. Vermont consistently had the smallest uninsured population from 2008 to 2020. Across nearly all U.S. states, the number of uninsured individuals declined significantly during this period, with the most pronounced reductions occurring around 2014, coinciding with the expansion of Medicaid. However, some states experienced a slight increase in their uninsured populations toward the end of the period.

Difference-in-Differences Estimation

Estimate Model

Do the following:

  • Choose a state that adopted the Medicaid expansion on January 1, 2014 and a state that did not. Hint: Do not pick Massachusetts as it passed a universal healthcare law in 2006, and also avoid picking a state that adopted the Medicaid expansion between 2014 and 2015.
  • Assess the parallel trends assumption for your choices using a plot. If you are not satisfied that the assumption has been met, pick another state and try again (but detail the states you tried).
# Parallel Trends plot
# Arkansas vs Mississippi

medicaid_expansion %>%
  filter(State %in% c("Arkansas","Tennessee","Mississippi", "Alabama"),
         year == 2014) 
## # A tibble: 4 × 5
##   State       Date_Adopted  year uninsured_rate population
##   <chr>       <date>       <dbl>          <dbl>      <dbl>
## 1 Alabama     NA            2014          0.120    4849377
## 2 Arkansas    2014-01-01    2014          0.118    2994079
## 3 Mississippi NA            2014          0.145    2984926
## 4 Tennessee   NA            2014          0.119    6549352
medicaid_expansion %>%
  filter(State %in% c("Arkansas","Tennessee","Mississippi", "Alabama")) %>%
  
  ggplot() + 
  geom_point(aes(x = year, 
                 y = uninsured_rate, 
                 color = State)) +
  
  geom_line(data = . %>% filter(State == "Arkansas"),
            aes(x = year,
                y = uninsured_rate,
                color = State),
            linewidth = 1) +
  
  geom_line(data = . %>% filter(State != "Arkansas"),
            aes(x = year,
                y = uninsured_rate,
                color = State),
            linewidth = .5) +
  
  geom_vline(aes(xintercept = 2014)) +
  
  scale_x_continuous(breaks = unique(medicaid_expansion$year)) +
  
  labs(
    title = "Uninsured Rate before/after Medicaid",
    subtitle = "Treat: Arkansas \n Control: Tennessee, Mississippi, and Alabama",
    x = "Year",
    y = "Uninsured Rate") +
  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray90"),
    legend.box = "vertical"
  ) +
  guides(
    alpha = "none",
    size = "none"
  )

Arkansas expanded Medicaid in 2014, whereas Alabama, Mississippi, and Tennessee did not. These states share certain similarities—such as geographic proximity, aspects of economic structure, poverty rates, and the proportion of rural populations—which makes Alabama, Mississippi, and Tennessee potential control units for comparison with Arkansas. Among them, Mississippi appears to exhibit the most consistent parallel trend with Arkansas, although it is not a perfect match. Except for 2008, the difference in uninsured rates between the two states remained relatively stable over time.

  • Estimates a difference-in-differences estimate of the effect of the Medicaid expansion on the uninsured share of the population. You may follow the lab example where we estimate the differences in one pre-treatment and one post-treatment period, or take an average of the pre-treatment and post-treatment outcomes
# Difference-in-Differences estimation

did_test <- medicaid_expansion %>% 
  filter(State %in% c("Arkansas", "Mississippi")) %>%
  select(State, year, uninsured_rate) %>%
  pivot_wider(names_from = State, values_from = uninsured_rate) %>%
  mutate(gap = Arkansas - Mississippi)

baseline_gap <- did_test %>% filter(year == 2013) %>% pull(gap)
did_test <- did_test %>%
  mutate(did_effect = gap - baseline_gap)

print(did_test[did_test$year >= 2014, c("year", "did_effect")])
## # A tibble: 7 × 2
##    year did_effect
##   <dbl>      <dbl>
## 1  2014    -0.0182
## 2  2015    -0.0222
## 3  2016    -0.0287
## 4  2017    -0.0321
## 5  2018    -0.0311
## 6  2019    -0.0296
## 7  2020    -0.0304
ggplot(did_test, aes(x = year, y = did_effect)) +
  geom_line(color = "steelblue", size = 1.2) +
  geom_point(size = 2, color = "steelblue") +
  
  geom_vline(xintercept = 2014, linetype = "dashed") +
  geom_hline(yintercept = 0, linetype = "dotted") +
  
  scale_x_continuous(breaks = unique(did_test$year)) +
  labs(title = "Event Study: Effect of Medicaid Expansion (Arkansas vs Mississippi)",
       subtitle = "Baseline Year = 2013",
       y = "Difference-in-Difference Effect",
       x = "Year") +
  
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
        plot.subtitle = element_text(hjust = 0.5, size = 11))

The Difference-in-Differences estimation indicates that Medicaid expansion led to a reduction in the uninsured rate of approximately 1.8% in 2014. This causal effect persisted throughout the entire observation period. From 2014 to 2017, the policy impact grew stronger, reaching around a 3% reduction in the uninsured rate, where it remained stable through 2020.

Discussion Questions

  • Card/Krueger’s original piece utilized the fact that towns on either side of the Delaware river are likely to be quite similar to one another in terms of demographics, economics, etc. Why is that intuition harder to replicate with this data?

  • Answer: The administrative division along the Delaware River offers a natural experiment. The river’s location can be considered effectively random, and towns on either side are broadly comparable. As a result, treatment assignment based on differing administrative jurisdictions across the river can be viewed as quasi-random. However, the dataset in question does not satisfy the condition of random treatment assignment, as each state independently decided whether and when to expand Medicaid. In other words,selection into treatment is an issue. Confounding factors—such as political climate and economic growth—may influence both the likelihood of Medicaid adoption (the treatment) and changes in the uninsured rate (the outcome), thereby introducing potential bias.

  • What are the strengths and weaknesses of using the parallel trends assumption in difference-in-differences estimates?

  • Answer: The parallel trends assumption enables causal inference using observational data. As noted above, randomized controlled trials are often infeasible for evaluating policy interventions. By assuming that treated and control groups would have followed similar trends in the absence of treatment, the DID framework controls for unobserved, time-invariant differences between groups as well as for common shocks over time. Because DID estimates treatment effects based on within-group changes, stable characteristics, such as geographic location, baseline demographics, or fixed institutional quality, are differenced out. Moreover, DID accounts for shared temporal influences, such as a nationwide economic recession, by capturing them through the control group’s changes, thereby isolating the treatment effect more accurately.

Nevertheless, the parallel trends assumption is strong and inherently untestable. While similar pre-treatment trends between treatment and control groups are necessary, they do not logically ensure that post-treatment trends would have remained the same in the absence of treatment. In practice, identifying a control group that closely matches the treatment group in trend is challenging. Because the DID framework relies heavily on this assumption, any violation can significantly compromise causal inference.

Synthetic Control

Estimate Synthetic Control

Although several states did not expand Medicaid on January 1, 2014, many did later on. In some cases, a Democratic governor was elected and pushed for a state budget that included the Medicaid expansion, whereas in others voters approved expansion via a ballot initiative. The 2018 election was a watershed moment where several Republican-leaning states elected Democratic governors and approved Medicaid expansion. In cases with a ballot initiative, the state legislature and governor still must implement the results via legislation. For instance, Idaho voters approved a Medicaid expansion in the 2018 election, but it was not implemented in the state budget until late 2019, with enrollment beginning in 2020.

Do the following:

# non-augmented synthetic control
medicaid_expansion %>% filter(Date_Adopted > as.Date("2014-01-01")) %>%
  select(State, Date_Adopted) %>% unique()
## # A tibble: 11 × 2
##    State         Date_Adopted
##    <chr>         <date>      
##  1 Alaska        2015-09-01  
##  2 Idaho         2020-01-01  
##  3 Indiana       2015-02-01  
##  4 Louisiana     2016-07-01  
##  5 Michigan      2014-04-01  
##  6 Montana       2016-01-01  
##  7 Nebraska      2020-10-01  
##  8 New Hampshire 2014-08-15  
##  9 Pennsylvania  2015-01-01  
## 10 Utah          2020-01-01  
## 11 Virginia      2019-01-01
# Pennsylvania
syn_df <- medicaid_expansion %>% 
  filter(State == "Pennsylvania" | is.na(Date_Adopted)) %>%
  mutate(treated = ifelse(State == "Pennsylvania"& year >= 2015, 1, 0))
plot_augsynth_paths <- function(aug_obj, 
                                data, 
                                treated_unit, 
                                outcome_var, 
                                time_var, 
                                treat_time, 
                                unit_var) {

  y_tr <- data %>%
    filter(!!sym(unit_var) == treated_unit) %>%
    arrange(!!sym(time_var)) %>%
    pull(!!sym(outcome_var))
  
  y_hat <- predict(aug_obj, att = FALSE)
  

  time_seq <- data %>%
    filter(!!sym(unit_var) == treated_unit) %>%
    arrange(!!sym(time_var)) %>%
    pull(!!sym(time_var))
  

  plot_df <- tibble(
    time = time_seq,
    treated = y_tr,
    synthetic = y_hat
  )
  
  ggplot(plot_df, aes(x = time)) +
    geom_line(aes(y = treated, colour = "Treated")) +
    geom_line(aes(y = synthetic, colour = "Synthetic")) +
    geom_vline(xintercept = treat_time, linetype = "dashed") +
    scale_colour_manual(values = c("Treated" = "darkred", "Synthetic" = "steelblue")) +
    labs(x = "Time", y = outcome_var, colour = "") +
    theme_minimal()
}
syn_non_arg <- augsynth(data = syn_df,
                        uninsured_rate ~ treated,
                        unit = State,
                        time = year,
                        t_int = 2015,
                        progfunc = "None",
                        scm = T)                           
## One outcome and one treatment time found. Running single_augsynth.
summary(syn_non_arg)
## 
## Call:
## single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), 
##     t_int = t_int, data = data, progfunc = "None", scm = ..2)
## 
## Average ATT Estimate (p Value for Joint Null):  -0.0215   ( 0.25 )
## L2 Imbalance: 0.026
## Percent improvement from uniform weights: 81.3%
## 
## Avg Estimated Bias: NA
## 
## Inference type: Conformal inference
## 
##  Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value
##  2015   -0.022             -0.065              0.021   0.107
##  2016   -0.020             -0.063              0.024   0.254
##  2017   -0.022             -0.065              0.021   0.120
##  2018   -0.022             -0.065              0.021   0.117
##  2019   -0.023             -0.066              0.020   0.120
##  2020   -0.021             -0.064              0.022   0.118
plot(syn_non_arg)

plot_augsynth_paths(syn_non_arg, 
                                syn_df, 
                                "Pennsylvania", 
                                "uninsured_rate", 
                                "year", 
                                2015, 
                                "State")

The results from the non-augmented synthetic control estimation indicate a modest ATT of -0.0215, suggesting that the Medicaid led to a 2.15 percentage point reduction in the uninsured rate in Pennsylvania after the policy implementation in 2015. However, the p-value of 0.25 suggests that this effect is not statistically significant at conventional levels. The second graph visually confirms that the treated unit experienced a sharper decline in uninsured rates post-treatment compared to its synthetic control, although the confidence interval in the first graph indicates some uncertainty around the estimated effect. The L2 imbalance of 0.026 and the 81.3% improvement from uniform weights show that the synthesized Pennsylvania was a fair fit, although the plot shows in the pre-treatment period the synthesized Pennsylvania and real Pennsylvania are not similar enough. This is largely due to the small dataset.

# augmented synthetic control
# Ridge
syn_ridge <- augsynth(data = syn_df,
                        uninsured_rate ~ treated,
                        unit = State,
                        time = year,
                        t_int = 2015,
                        progfunc = "Ridge",
                        scm = T)                           
## One outcome and one treatment time found. Running single_augsynth.
summary(syn_ridge)
## 
## Call:
## single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), 
##     t_int = t_int, data = data, progfunc = "Ridge", scm = ..2)
## 
## Average ATT Estimate (p Value for Joint Null):  -0.0151   ( 0.52 )
## L2 Imbalance: 0.010
## Percent improvement from uniform weights: 93.1%
## 
## Avg Estimated Bias: -0.006
## 
## Inference type: Conformal inference
## 
##  Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value
##  2015   -0.015             -0.045              0.015   0.111
##  2016   -0.013             -0.043              0.017   0.119
##  2017   -0.016             -0.046              0.014   0.123
##  2018   -0.015             -0.045              0.015   0.124
##  2019   -0.016             -0.046              0.014   0.123
##  2020   -0.015             -0.045              0.015   0.125
plot(syn_ridge)

plot_augsynth_paths(syn_ridge, 
                                syn_df, 
                                "Pennsylvania", 
                                "uninsured_rate", 
                                "year", 
                                2015, 
                                "State")

Ridge regression is used in augmented synthetic control methods to improve the pre-treatment fit by shrinking the weights assigned to control units, reducing overfitting and enhancing generalizability. In this analysis, applying ridge regression yielded an estimated ATT of -1.51 percentage points, with a high p-value of 0.52, indicating no statistically significant impact. However, the method achieved a better pre-treatment fit, reflected in a lower L2 imbalance (0.010) and a 93.1% improvement from uniform weights, compared to the non-augmented model (which had an L2 imbalance of 0.026 and 81.3% improvement). Despite the weaker statistical signal, the ridge-augmented model provides a more precise and credible counterfactual.

data.frame(syn_non_arg$weights) %>%

  tibble::rownames_to_column('State') %>%
  filter(syn_non_arg.weights > 0) %>% 

  ggplot() +
  geom_bar(aes(x = State, 
               y = syn_non_arg.weights),
           stat = 'identity') +
  coord_flip() +   
  theme_fivethirtyeight() +
  theme(axis.title = element_text()) +
  ggtitle('Synthetic Control Weights') +
  xlab('State') +
  ylab('Weight') +
  theme_minimal()

data.frame(syn_ridge$weights) %>%
  tibble::rownames_to_column('State') %>%
 ggplot() +
  geom_bar(aes(x = State, 
               y = syn_ridge.weights),
           stat = 'identity') +
  coord_flip() +  

  theme_fivethirtyeight() +
  theme(axis.title = element_text()) +

  ggtitle('Synthetic Control Weights') +
  xlab('State') +
  ylab('Weight') +
  theme_minimal()

HINT: Is there any preprocessing you need to do before you allow the program to automatically find weights for donor states?

Discussion Questions

  • What are the advantages and disadvantages of synthetic control compared to difference-in-differences estimators?

  • Answer: Unlike the DiD method, synthetic control does not rely on the parallel trends assumption between treated and control units. Instead, it assumes that a weighted combination of control units can closely approximate the counterfactual trend for the treated unit. This approach is particularly useful when no single unit—or simple average of units—provides a suitable comparison. As a result, synthetic control avoids the often subjective or arbitrary selection of control groups that can occur in some DID applications. Moreover, synthetic control offers greater transparency and interpretability. The weights assigned to each control unit explicitly reveal how the synthetic control is constructed, allowing researchers to identify which units contribute most to the counterfactual estimate. This can yield valuable substantive insights. However, synthetic control is primarily designed for cases with a single or small number of treated units and requires a relatively long panel dataset. It relies on a sufficient number of pre-treatment periods to estimate reliable weights and to achieve a good pre-treatment fit between the treated unit and its synthetic counterpart. In addition, it requires pre-treatment covariate data—predictors of the outcome—to guide weight selection. In contrast, standard DID methods only require outcome data for both treated and control groups over time.

  • One of the benefits of synthetic control is that the weights are bounded between [0,1] and the weights must sum to 1. Augmentation might relax this assumption by allowing for negative weights. Does this create an interpretation problem, and how should we balance this consideration against the improvements augmentation offers in terms of imbalance in the pre-treatment period?

  • Answer: In the classic synthetic control, all weights are non-negative and sum to one, resulting in a convex combination of control units. This structure supports a straightforward interpretation: the synthetic control represents a weighted average of actual units from the donor pool. However, when ridge regularization is introduced—as in the augmented synthetic control framework—negative weights may be assigned to some control units in order to reduce imbalance, particularly when pre-treatment trends differ substantially. This compromises the intuitive interpretation of the synthetic control as a blend of real-world analogs. If the standard synthetic control achieves a good pre-treatment fit, it is generally preferable due to its clear interpretability and its interpolation-based nature. However, if the fit is poor, concerns about estimation bias become more pressing. In such cases, an augmented method that substantially improves fit may be more appropriate, even if it sacrifices some interpretability—provided that the results are robust and validated through placebo or sensitivity checks. Clear communication about this trade-off is essential for credible inference.

Staggered Adoption Synthetic Control

Estimate Multisynth

Do the following:

  • Estimate a multisynth model that treats each state individually. Choose a fraction of states that you can fit on a plot and examine their treatment effects.
# multisynth model states
multisyn_df <- medicaid_expansion %>% 
  mutate(treat_year = as.numeric(format(as.Date(Date_Adopted), "%Y")),
         treated = 1* (year >= treat_year))
multisyn_i <- multisynth(uninsured_rate ~ treated, 
                        State,                      
                        year,                  
                        multisyn_df, 
                        n_leads = 10)
summary(multisyn_i)
## 
## Call:
## multisynth(form = uninsured_rate ~ treated, unit = State, time = year, 
##     data = multisyn_df, n_leads = 10)
## 
## Average ATT Estimate (Std. Error): -0.015  (0.006)
## 
## Global L2 Imbalance: 0.000
## Scaled Global L2 Imbalance: 0.016
## Percent improvement from uniform global weights: 98.4
## 
## Individual L2 Imbalance: 0.005
## Scaled Individual L2 Imbalance: 0.101
## Percent improvement from uniform individual weights: 89.9    
## 
##  Time Since Treatment   Level     Estimate   Std.Error lower_bound  upper_bound
##                     0 Average -0.009736317 0.004679493 -0.01948349 -0.001086420
##                     1 Average -0.016568450 0.006323628 -0.02989597 -0.005489879
##                     2 Average -0.015400010 0.006579639 -0.02907955 -0.003607858
##                     3 Average -0.017866416 0.006830210 -0.03226028 -0.005413554
##                     4 Average -0.019238529 0.006635683 -0.03345751 -0.007344787
##                     5 Average -0.018328885 0.006497176 -0.03165532 -0.006776385
##                     6 Average -0.018096801 0.007031400 -0.03273742 -0.005562136
plot(multisyn_i)
## Joining with `by = join_by(Level)`
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the augsynth package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 253 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 253 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

multisyn_i_sum <- summary(multisyn_i)

multisyn_i_sum$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        legend.position = "bottom") +
  ggtitle('Synthetic Controls for Medicaid (Individual)') +
  xlab('Time') +
  ylab('Uninsured_rate')
## Warning: Removed 253 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 253 rows containing missing values or values outside the scale range
## (`geom_line()`).

multisyn_i_sum$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        legend.position = 'None') +
  ggtitle('Synthetic Controls for Medicaid (Individual)') +
  xlab('Time') +
  ylab('Uninsured Rate') +
  facet_wrap(~Level)
## Warning: Removed 253 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 253 rows containing missing values or values outside the scale range
## (`geom_line()`).

  • Estimate a multisynth model using time cohorts. For the purpose of this exercise, you can simplify the treatment time so that states that adopted Medicaid expansion within the same year (i.e. all states that adopted epxansion in 2016) count for the same cohort. Plot the treatment effects for these time cohorts.
# multisynth model time cohorts
multisyn_tc <- multisynth(uninsured_rate ~ treated, 
                        State,                      
                        year,                  
                        multisyn_df, 
                        n_leads = 10,
                        time_cohort = TRUE)
summary(multisyn_tc)
## 
## Call:
## multisynth(form = uninsured_rate ~ treated, unit = State, time = year, 
##     data = multisyn_df, n_leads = 10, time_cohort = TRUE)
## 
## Average ATT Estimate (Std. Error): -0.016  (0.005)
## 
## Global L2 Imbalance: 0.001
## Scaled Global L2 Imbalance: 0.007
## Percent improvement from uniform global weights: 99.3
## 
## Individual L2 Imbalance: 0.005
## Scaled Individual L2 Imbalance: 0.015
## Percent improvement from uniform individual weights: 98.5    
## 
##  Time Since Treatment   Level     Estimate   Std.Error lower_bound  upper_bound
##                     0 Average -0.009699016 0.004264554 -0.01855545 -0.001778161
##                     1 Average -0.017346723 0.005765867 -0.02865796 -0.006283963
##                     2 Average -0.015234130 0.005876860 -0.02695995 -0.003403163
##                     3 Average -0.018128760 0.006113397 -0.03046936 -0.006289020
##                     4 Average -0.019448849 0.006046466 -0.03165735 -0.007808059
##                     5 Average -0.018711810 0.005744099 -0.03028272 -0.007823958
##                     6 Average -0.018354004 0.006221218 -0.03059071 -0.006565221
plot(multisyn_tc)
## Joining with `by = join_by(Level)`
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).

multisyn_tc_sum <- summary(multisyn_tc)

multisyn_tc_sum$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        legend.position = "bottom") +
  ggtitle('Synthetic Controls for Medicaid (Time Cohort)') +
  xlab('Time') +
  ylab('Uninsured_rate')
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_line()`).

multisyn_tc_sum$att %>%
  ggplot(aes(x = Time, y = Estimate, color = Level)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = 0) +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        legend.position = 'None') +
  ggtitle('Synthetic Controls for Medicaid (Time Cohort)') +
  xlab('Time') +
  ylab('Uninsured Rate') +
  facet_wrap(~Level)
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_line()`).

Discussion Questions

  • One feature of Medicaid is that it is jointly administered by the federal government and the states, and states have some flexibility in how they implement Medicaid. For example, during the Trump administration, several states applied for waivers where they could add work requirements to the eligibility standards (i.e. an individual needed to work for 80 hours/month to qualify for Medicaid). Given these differences, do you see evidence for the idea that different states had different treatment effect sizes?

  • Answer: The results from the multisynth model, which estimates treatment effects for each state individually, clearly highlight heterogeneity in effect sizes across states. States such as Arkansas, Nevada, and New Mexico exhibit relatively large negative treatment effects, indicating substantial reductions in uninsured rates following Medicaid expansion. In contrast, states like Connecticut, Delaware, and Massachusetts show smaller or negligible effects, likely due to already low baseline uninsured rates or pre-existing coverage expansions. The variation in post-treatment deviations from synthetic controls underscores that treatment effects are not uniform. This suggests that state-level implementation factors—such as the use of waivers, the aggressiveness of rollout, and administrative capacity—play a meaningful role. These findings are consistent with the decentralized nature of Medicaid, which allows for differences in eligibility criteria, waiver usage (e.g., work requirements), and administrative execution.

  • Do you see evidence for the idea that early adopters of Medicaid expansion enjoyed a larger decrease in the uninsured population?

  • Answer: The multisynth model using time cohorts, in contrast, focuses on heterogeneity by adoption timing rather than by individual unit. For each cohort, the model constructs a weighted combination of control units to minimize the overall squared error between the synthesized counterfactual and all units within the cohort. States that adopted Medicaid expansion in 2016 show a notable decline in uninsured rates—approximately -0.02 immediately following implementation. In 2017 and 2018, the ATTs exceed -0.03. States that adopted Medicaid expansion later (2019–2020) display more muted effects, and in some cases—particularly in 2020—no clear impact is observed. This may be attributed to the shorter post-treatment observation window and confounding shocks such as the COVID-19 pandemic. The stronger effects among earlier adopters may reflect greater preparedness, larger eligible populations, or more effective outreach efforts.

General Discussion Questions